Scalability and Validation of Big Data Bioinformatics Software

نویسندگان

  • Andrian Yang
  • Michael Troup
  • Joshua W.K. Ho
چکیده

This review examines two important aspects that are central to modern big data bioinformatics analysis - software scalability and validity. We argue that not only are the issues of scalability and validation common to all big data bioinformatics analyses, they can be tackled by conceptually related methodological approaches, namely divide-and-conquer (scalability) and multiple executions (validation). Scalability is defined as the ability for a program to scale based on workload. It has always been an important consideration when developing bioinformatics algorithms and programs. Nonetheless the surge of volume and variety of biological and biomedical data has posed new challenges. We discuss how modern cloud computing and big data programming frameworks such as MapReduce and Spark are being used to effectively implement divide-and-conquer in a distributed computing environment. Validation of software is another important issue in big data bioinformatics that is often ignored. Software validation is the process of determining whether the program under test fulfils the task for which it was designed. Determining the correctness of the computational output of big data bioinformatics software is especially difficult due to the large input space and complex algorithms involved. We discuss how state-of-the-art software testing techniques that are based on the idea of multiple executions, such as metamorphic testing, can be used to implement an effective bioinformatics quality assurance strategy. We hope this review will raise awareness of these critical issues in bioinformatics.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

GenoMetric Query Language: a novel approach to large-scale genomic data management

MOTIVATION Improvement of sequencing technologies and data processing pipelines is rapidly providing sequencing data, with associated high-level features, of many individual genomes in multiple biological and clinical conditions. They allow for data-driven genomic, transcriptomic and epigenomic characterizations, but require state-of-the-art 'big data' computing strategies, with abstraction lev...

متن کامل

Spark Scalability Analysis in a Scientific Workflow

Spark is being successfully used for big data parallel processing in many business domains (social media, finance, retail). Spark’s scalability, usability, and large user community have motivated developers from scientific domains (bioinformatics, oil and gas, astronomy) to try it. However, scientific applications’ profile, e.g., black-box programs and intense file writes, differs from traditio...

متن کامل

Impact of Biological Big Data in Bioinformatics

In an era of high-throughput sequencing, drilling of biological data to extract hidden valued information plays an important role in making critical decisions across every branch of science, whether it is genomics or proteomics or metabolomics or personal medicine. For example, the genome sequence of the patients contains much valuable information about the myriad of disease causes, easy extrac...

متن کامل

Genome analysis GenoMetric Query Language: a novel approach to large-scale genomic data management

Motivation: Improvement of sequencing technologies and data processing pipelines is rapidly providing sequencing data, with associated high-level features, of many individual genomes in multiple biological and clinical conditions. They allow for data-driven genomic, transcriptomic and epigenomic characterizations, but require state-of-the-art ‘big data’ computing strategies, with abstraction le...

متن کامل

Classification in Data Mining and Analysis of Fuzzy Based Methods –big Data Analytics

Data mining comprises various applications that specifically includes biological and biomedicine data which continues to be research oriented task in the bioinformatics field. In today’s world big data applications are mainly focused because of enormous increase in data and also its storage. This leads to major problem when extracting knowledge or information by processing huge amount of data. ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره 15  شماره 

صفحات  -

تاریخ انتشار 2017